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1 Introduction 

Non-trivial consistency problems e.g. file systems, collaborative environments, and databases, 
are the major challenge of large-scale systems. Recently some architectures have emerged 
to scale file systems up to thousands of nodes |12[ \TE\ [3j , but no practical solution exists for 
database systems. 

At the cluster level protocols based on group communication primitives [H 111! 116] are 
the most promising solutions to replicate database systems |22j . fn this article we extend 
the group communication approach to large-scale systems. 

Highlights of our protocol: 

. Replicas do not re-execute transactions, but apply update values only. 

. We do not compute a total order over of operations. Instead transactions are partially 
ordered. Two transactions are ordered only over the data where they conflict. 

. For every transaction T we maintain the graph of T's dependencies. T commits locally 
when T is transitively closed in this graph. 

The outline of the paper is the following. Section [5] introduces our model and assump- 
tions. Section [3] presents our algorithm. We conclude in Section 0] after a survey of related 
work. An appendix follows containing a proof of correctness. 

2 System model and assumptions 

We consider a finite set of asynchronous processes or sites II, forming a distributed system. 
Sites may fail by crashing, and links between sites are asynchronous but reliable. Each site 
holds a database that we model as some finite set of data items. We left unspecified the 
granularity of a data item. In the relational model, it can be a column, a table, or even a 
whole relational database. Given a data item x, the replicas of x, noted replicas (x), are the 
subset of II whose databases contain x. 

We base our algorithm on the three following primitives 

. Uniform Reliable Multicast takes as input a unique message m and a single group of 
sites jCII. Uniform reliable multicast consists of the two primitives R-multicast(m) 
and R-deliver(m) . With Uniform Reliable Multicast, all sites in g have the following 
guarantees: 

— Uniform Integrity: For every message m, every site in g performs R-deliver(m) 
at most once, and only if some site performed R- multicast (m) previously. 

— Validity: if a correct site in g performs R- multicast (to) then it eventually performs 
R-deliver(TO). 

1 Our taxonomy comes from \E\. 
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Table 1: Lock conflict table 



— Uniform Agreement: if a site in g performs R-deliver(m), then every correct sites 
in g eventually performs R-deliver(m) . 

Uniform Reliable Multicast is solvable in an asynchronous systems with reliable links 
and crash- prone sites. 

. Uniform Total Order Multicast takes as input a unique message m and a single group of 
sites g. Uniform Total Order Multicast consists of the two primitives TO-multicast(m) 
and TO-deliver(m). This communication primitive ensures Uniform Integrity, Validity, 
Uniform Agreement and Uniform Total Order in g: 

— Uniform Total Order: if a site in g performs TO-deliver(m) and TO-deliver(m') 
in this order, then every site in g that performs TO-deliver(m') has performed 
previously TO-deliver(m). 

. Eventual Weak Leader Service Given a group of sites g, a site i £ g may call function 
WLeader(g). WLeader(g) returns a weak leader of g : 

— WLeader(g) 6 g. 

— Let p be a run of II such that a non-empty subset c of g is correct in p. It exists 
a site i S c and a time t such that for any calls of WLeader(g) on i after t, 
WLeader(g) returns i. 

This service is strictly weaker than the classical eventual leader service f2 [T8], since 
we do not require that every correct site eventually outputs the same leader. An 
algorithm that returns to every process itself, trivially implements the Eventual Weak 
Leader Service. 

In the following we make two assumptions: during any run, (Al) for any data item x, 
at least one replica of x is correct, and (A2) Uniform Total Order Multicast is solvable in 
replicas (x) . 

2.1 Operations and locks 

Clients of the system (not modeled), access data items using read and write operations. 
Each operation is uniquely identified, and accesses a single data item. A read operation is 
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a singleton: the data item read, a write operation is a couple: the data item written, and 
the update value. 

When an operation accesses a data item on a site, it takes a lock. We consider the 
three following types of locks: read lock (R), write lock (W), and intention to write lock 
(IW). Table [1] illustrates how locks conflict with each other; when an operation requests a 
lock to access a data item, if the lock is already taken and cannot be shared, the request is 
enqueued in a FIFO queue. In Table [TJ means that the request is enqueued, and 1 that 
the lock is granted. 

Given an operation o, we note: 

. itemip), the data item operation o accesses, 

. isRead(o) (resp. isWrite(o)) a boolean indicating whether o is a read (resp. a write), 

. and replicas(o) = replicas (item(o)); 

We say that two operations o and d conflict if they access the same data item and one of 
them is a write: 



2.2 Transactions 

Clients group their operations into transactions. A transaction is a uniquely identified set 
of read and write operations. Given a transaction T, 

. for any operation o 6 T, function trans (o) returns T, 

. ro(T) (respectively wo(T)) is the subset of read (resp. write) operations, 

. itemiT) is the set of data items transaction T accesses: item(T) = {J oeT item(o). 

. and replicas (T) = replicas {item (T)) . 

Once a site i grants a lock to a transaction T, T holds it until i commits T, i aborts T, 
or we explicitly say that this lock is released. 



As replicas execute transactions, it creates precedence constraints between conflicting trans- 
actions. Serializability theory tell us that this relation must be acyclic [2J. 

One solution to this problem, is given a transaction T, (i) to execute T on every replicas of 
T, (ii) to compute the transitive closure of the precedence constraints linking T to concurrent 
conflicting transactions, and (iii) if a cycle appears, to abort at least one the transactions 
involved in this cycle. 




3 The algorithm 
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Unfortunately as the number of replicas grows, sites may crash, and the network may 
experience congestion. Consequently to compute (ii) the replicas of T need to agree upon 
the set of concurrent transactions accessing item(T). 

Our solution is to use a TO-multicast protocol per data item. 



3.1 Overview 

To ease our presentation we consider in the following that a transaction executes initially on 
a single site. Section 13.91 generalizes our approach to the case where a transaction initially 
executes on more than one site. We structure our algorithm in five phases: 

. In the initial execution phase, a transaction T executes at some site i. 
. In the submission phase, i transmits T to replicas(T). 

. In the certification phase, a site j aborts T if T has read an outdated value. If T 
is not aborted, j computes all the precedence constraints linking T to transactions 
previously received at site j. 

. In the closure phase, j completes its knowledge about precedence constraints linking 
T to others transactions. 

. Once T is closed at site j, the commitment phase takes place, j decides locally whether 
to commit or abort T. This decision is deterministic, and identical on every site 
replicating a data item written by T. 



3.2 Initial execution phase 

A site i executes a transaction T coming from a client according to the two-phases locking 
rule [2], but without applying write operation^. When site T reaches a commit statement, 
it is not committed, instead i releases T's read locks, converts T's write locks into intention 
to write locks, computes T's update values, and then proceeds to the submission phase. 



3.3 Submission phase 

In this phase i R-multicasts T to replicas(T). When a site j receives T, j marks all T's 
operations as pending using variable pending. Then if it exists an operation o G pending, 
such that j — WLeader [replicas (o)) , j TO-multicasts o to replicas (o) 1 



2 If T writes a data item x then reads it, we suppose some internals to ensure that T sees a consistent 
value. 

3 If instead of this procedure, i TO-multicasts all the operations, then the system blocks if i crashes. We 
use a weak leader and a reliable multicast to preserve liveness. 
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3.4 Certification phase 

When a site i TO-delivers an operation o for the first tim^, 
certifies o. 



i removes o from pending, i 



To certify o, i considers any preceding write operations that conflicts with o. We say that 
a conflicting operation o' precedes o at site i, o'—>iO, if i TO-delivers o' then i TO-delivers o: 



Where given two events e and e', e -< e' is the relation e happens-before e', and TO-deliveri(o') 
is the event: "site i TO-delivers operation o"\ 

If o is a read, we check that o did not read an outdated value. It happens when o executes 
concurrently to a conflicting write operation d that is now committed. Let committedi 
be the set of transactions committed at site i, the read operation o aborts, if it exists 
an operation o' such that o'—>iO A trans(o') \\ trans(o) A trans(o') 6 committedi, where 
trans(o') \\ trans(o) means that the transactions trans(o') and trans(o) were executed con- 
currently during the initial execution phase. 

If now o is a write, i gives an IW lock to o: function forceWriteLock(o) . If an operation 
o' holds a conflicting IW lock, o and d share the lock (see Table [TJ; otherwise it means that 
trans{d) is still executing at site i, and function forceWriteLock(o) aborts itjf) 

3.5 Precedence graph 

Our algorithm decides to commit or abort transactions, according to a precedence graph. 
A precedence graph G is a directed graph where each node is a transaction T, and each 
directed edge T— >T", models a precedence constraint between an operation of T, and a write 
operation of T'\ 



A precedence graph contains also for each vertex T a flag indicating whether T is aborted 
or not: isAbortediT , G), and the subset of T's operations: op(T, G), which contribute to the 
relations linking T to others transactions in G. 

Given a precedence graph G, we note G.V its vertices set, and G.£ its edges set. Let G 
and G' be two precedence graphs, the union between G and G', GUG", is such that: 




A / TO-deliver 4 (o') -< TO-deliveri (o) 



T^T' = 3(o,d) ETx T',3i 



e n, d^ t o 



. (GUG').V = G.VUG'.V, 



. {GUG')£ = G.£UG'£, 



. VT £ (G U G').V, isAborted{T, (G U G')) = isAborted(T, G) V isAborted(T, G'). 



. VT e (G U G').V, op(T, (G U G')) = op(T, G) U op(T, G 1 ). 



4 Recall that the leader is eventual, consequently i may receive o more than one time. 
5 This operation prevents local deadlocks. 
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Algorithm 1 decide(T,G), code for site i 

1: variable G' := (0,0) > a directed graph 

2: 

3: for all C C cycles (G) do 

4: if ~iT eC,^isAborted(T,G) then 

5: G':=G'UC 

6: if T G breakCycles(G') then 

7: return /a/se 

8: else 

9: return true 



We say that G is a subset of G', noted G C G', if: 
. GV C G'.VAGi C G'.£, 
. VT g GV, isAborted(T,G) =>■ isAborted(T,G'), 
. VT G GV, op(T,G) C o P (T,G r ). 

Let G be a precedence graph, m(T, G) (respectively out(T, G)) is the restriction of GV 
to the subset of vertices formed by T and its incoming (resp. outgoing) neighbors. The 
predecessors of T in G: pred{T,G), is the precedence graph representing the transitive 
closure of the dual of the relation G.£ on {T}. 

3.6 Deciding 

Each site i stores its own precedence graph Gi, and decides locally to commit or abort a 
transaction according to it. More precisely i decides according to the graph pred(T, Gi). For 
any cycle C in the set of cycles in pred(T,Gi) : cycles (pred(T, Gi)) . i must abort at least 
one transaction in G. This decision is deterministic, and i tries to minimize the number of 
transactions aborted. 

Formally speaking i solves the minimum feedback vertex set problem over the union of 
all cycles in pred(T, Gi) containing only non-aborted transactions The minimum feedback 
vertex set problem is an NP-complete optimization problem, and the literature about this 
problem is vast [6]. We consequently postulate the existence of an heuristic: break Cycles (). 
breakCycles{) takes as input a directed graph G, and returns a vertex set S such that G\S 
is acyclic. 

Now considering a transaction T G Gi such that G = pred(T,Gi), Algorithm [T] returns 
false if i aborts T, or true otherwise. 

3.7 Closure phase 

In our model sites replicate data partially, and consequently maintain an incomplete view 
of the precedence constraints linking transactions in the system. Consequently they need to 
complete their view by exchanging parts of their graphs. This is our closure phase: 
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. When i TO-delivers an operation o £ T, i adds T to its precedence graph, and adds o 
to op(T,Gi). Then i sends pred(T,Gi) to replicas (out (T, Gi)) (line[M|). 

. When i receives a precedence graph G, if G $2 Gi, for every transaction T in Gi, such 
that pred(T, G) % pred(T, Gi), i sends pred(T, GUGi) to replicas (out (T, Gi)). Then z 
merges G to Gi (lines [31] to [35]) . 

Once z knows all the precedence constraints linking T to others transactions, we say that 
T is closed at site i. Formally T is closed at site i when the following fixed-point equation 
is true at site i: 



Our closure phase ensures that during every run p, for every correct site i, and every 
transaction T which is eventually in Gi, T is eventually closed at site i. 

3.8 Commitment phase 

If T is a read-only transaction: wo(T) = 0, i commits T as soon as T is executed (line EI- 

If T is an update, i waits that T is closed and holds all its IW locks: function holdlWLocks () 
(line l35|) . Once these two conditions hold, i computes decide(T,pred(T,Gi)). If this call 
returns true, i commits T: for each write operation o £ wo(T), with i £ replicas (o) , i 
considers any write operation o' such that T—>trans(o') £ Gi A conflict(o, o'). If trans(o') is 
already committed at site i, i does nothing; otherwise i applies o to its database. 

Algorithm [2] describes our algorithm. This protocol provides serializability for partially 
replicated database systems: any run of this protocol is equivalent to a run on a single site 
[5] . The proof of correctness appears in Appendix. 

3.9 Initial execution on more than one site 

When initial execution phase does not take place on a single site we compute the read-from 
dependencies. More precisely when a site i receives a read o accessing a data item it does 
not replicate, i sends o to some replica j £ replicas (o). Upon reception j executes o. At the 
end of execution j sends back to i the transitive closure containing read-from dependencies 
and starting from T. 

Once i has executed locally or remotely all the read operations in T, i checks if the 
resulting graph contains cycles in which T is involved. If this is the case, T will be aborted, 
and instead of submitting it, i re-executes at least one of T's read operations Otherwise i 
computes the write set and the update values, and sends T with its read-from dependencies 
by Uniform Reliable Multicast. The dependencies are merged to precedence graph when a 
site receives an operation by Total Order Multicast. The rest of the algorithm remains the 
same. 
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Algorithm 2 code for site i 



variables Gi := (0, 0); pending := 

loop > Initial execution 

let T be a new transaction 
initialExecution(T) 
if wo(T) j= then 

R-multicast(T) to replicas(T) 
else 

comm«((T) 

when R-deliver(T) > Submission 

for all o £ T : i £ replicas (o) do 
pending := pending U {o} 

when 3o G pending A i = WLeader (replicas (o)) 
TO- multicast (o) to replicas (o) 

when TO-deliver(o) for the first time > Certification 

pending := pending \ {o} 
let T = trans (o) 
Gi.V := Gi.V U {T} 
op(T, Gi) := op(T, Gi)U{o} 

if isRead(o) A (3o', o'^iO A trans(o') \\ trans(o) A trans(o') € committedi) then 

setAboried(T, Gi) 
else if isWnie(o) then 

/orce WriteLock (o) 

for all o' : o'—>iO do 

Gi.£ :=Gi.fU{(trofis(o'),T)} 
send(prerf(T, Gi)) to replicas (out (T,Gi)) 

when receive(T, G) > Closure 

for all T € Gi do 

if pred(T,G) g pred(T,Gi) then 

send(pred(T, Gi U G)) to replicas (out (T, Gi)) 

Gi := G, U G 

(i £ replicas (wo (T)) 
closed(T,Gi) > Commitment 

holdIWLocks(T) 
if -i is Aborted (T,Gi) A decide(T, pred(T,Gi)) then 

commit(T) 
else 

a6ori(T) 
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3.10 Performance analysis 

We consider Paxos [H] as a solution to Uniform Total Order Multicast. Since precedence 
constraints in a cycle are not causally related, Algorithm [2] achieves a message delay of 5: 2 
for Uniform Reliable Multicast, and 3 for Uniform Total Order Multicast. It reduces to 4, 
if in each replica group the leader of Paxos is also the weak leader of g. 

Let o be the number of operations per transaction, and d be the replication degree, the 
message complexity of Algorithm [2] is hod + (od) 2 : 2od for Uniform Reliable Multicast, o 
Uniform Total Order Multicasts, each costing 2d messages, and od replicas execute line \29\ 
each site sending od messages. Again, if in each replica group, the leader of Paxos is also 
the weak leader of g, the message complexity of our protocol reduces to Aod + (od) 2 

4 Concluding remarks 

4.1 Related work 

Gray et al. [7] prove that scale traditional eager and lazy replications does not scale: the 
deadlock rate increase as the cube of the number of sites, and the reconciliation rate increases 
as the square. Wiesmann and Schiper confirm practically this result [ 22 J . Fritzke et al. [TU] 
propose a replication scheme where sites TO-multicast each operations and execute them 
upon reception. However they do not prevent global deadlocks with a priority rule; it 
increases abort rate. Preventive replication [TH] considers that a bound on processor speed, 
and network delay is known. Such assumptions do not hold in a large-scale system. The 
epidemic algorithm of Holiday et al [3] aborts concurrent conflicting transactions and their 
protocol is not live in spite of one fault. In all of these replication schemes, each replica 
execute all the operations accessing the data items it replicates. Alonso proves analytically 
that it reduces the scale- up of the system pQ. 

The DataBase State Machine approach [T7| applies update values only but in a fully 
replicated environment. Its extensions [191 121) to partial replication require a total order 
over transactions. 

Committing transactions using a distributed serialization graph is a well-known technique 
[2D]. Recently Haller et al. have proposed to apply it [5] to large-scale systems, but their 
solution does not handle replication, nor faults. 

4.2 Conclusion 

We present an algorithm for replicating database systems in a large-scale system. Our solu- 
tion is live and safe in presence of non-byzantine faults. Our key idea is to order conflicting 
transaction per data item, then to break cycles between transactions. Compared to pre- 
vious existing solutions, ours either achieves lower latency and message cost, or does not 
unnecessarily abort concurrent conflicting transactions. 

The closure of constraints graphs is a classical idea in distributed systems. We may 
find it in the very first algorithm about State Machine Replication [13) . or in a well-known 



RR n° 6440 



12 



Sutra & Shapiro 



algorithm to solve Total Order Multicast [S]0We believe that the closure generalizes to a 
wider context, where a constraint is a temporal logic formula over sequences of concurrent 
operations. 
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.1 Additionnal notations 

We note D the universal set of data item, T the universal set of transactions, and G the 
universal set of precedence graphs constructed upon D. 

Let p be a run of Algorithm^ given a site i we note eventi when the event event happens 
at site i during p ; moreover if value is the result of this event we note it: eventi — value. 

Let p be a run of Algorithm we note: 

. faulty(p) the set of sites that crashes during p, 
. correct (p) the set IT \ faulty (p). 

. committed (p) the transactions committed during p, i.e. {T £ T, Bi 6 IT, T 6 committedi}, 
. and aborted(p) the transactions aborted during p, i.e. {T 6 T, 3i G II, T G abortedi}. 
Given a site i and a time t, we note G^t the value of Gi at time t. 

.2 Proof of correctness 

Since the serializability theory is over a finite set of transactions, we suppose hereafter that 
during p a finite subset of T is sent to the system. 

Let p be a run of Algorithm we now proove a series of propositions leading to the fact 
that p is serializable. 

(Pi) 

MT G T, (3j e n,R-deliver,(T) 6 p) 

=> (Vo € T, Vi S replicas(o) fl correct (p), TO-deliveri(o) G p) 



Froo/ 

Let T be a transaction and j a site that R-delivers T during p. 
Fl.l Vi G replicas(T) (~l correct (p), R-delivers (T) 

By the Uniform Agreement property of Uniform Reliable Multicast. 
F1.2 Vo G T, 3fc G correct(p) fl replicas(o), TO-multicastfe(o) G p 

Fl.2.1 3/ G correct(p) (~l replicas(o), WLeaderi(replicas(o)) — I A R-deliver; (o) 

By fact F.l.l, assumption Al and the properties of the Eventual Weak 
Leader Service. 

By fact Fl.2.1 eventually a correct site executes line H~6l in Algorithm [2l 
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Fact F1.2 and the Validity and the Agreement properties of Total Order Multicast 
conclude our claim. 

□ 

In the following we say that a transaction T is submitted to the system: T G submitted (p), 
if a site i R-delivers T during p. 

JPg] 

VT G submitted(p),\fi G replicas (T),Vo G T, 

3G G G, o G op(T, G) A receive, (G) 

Troo/ 

F2.1 Vi G n,VM',i > i' =>• G M C G i>t , 
F2.2 VG G G, VT G T, T e G => T e pred{T, G) 
By definition of pred(T, G). 
By proposition PI, facts F2.1 and F2.2, and since links are reliable. 

□ 

(P3) 

VT,T' G submitted{p), 

T^T' (3o, o' G T x T', 3i G correct(p),o->io') 

Proof 

By definition of T— >T', let 0,0' G T x T" and let j be a site such that o— >jo'. Since 
conflict (0,0') and an operation applies on a single data item, we note x the unique data 
item such that x = item{o) = item(o'). 

F3.1 j G replicas (x) 

Site j TO-delivers o during p and links are reliable. 
F3.2 3i G replicas(x) n correct (p), TO-deliver, (o) A TO-deliverj(o') 
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By assumption Al 3i G replicas(x) (~l correct(p) , and by the Uniform Agreement 
property of Total Order Multicast, since i is correct during p, i TO-delivers both 
o and o' . 

Fact F3.2 and the Total Order property of Total Order Multicast concludes our claim. 

□ 




VT G submitted(p),\/i G IT 
(3t, T G Gij) => (3Ti, . . . , T m >o G submitted(p), i G replicas(T m ) A T — > Ti — » . . . — > T m ) 

Proof 

Since G^.o = (0, 0), let us consider the first time at which T G Gi f 
According to Algorithm [2] either: 

. i TO-delivers an operation o G T at to, and thus i G replicas (T). QED 

. or i receives a precedence graph G' from a site j such that T G G'. Now since links 
are reliable, note t\ the time at which j send G' to i. According to lines [^51 and [Ml it 
exists a transactions T" such that T G pred(T', Gj^), and a transaction T" such that 
T" G ottt(T', G^tJ and i G replicas(T"). 

From T G pred(T' ,), by definition of the predecessors, we obtain T — ► . .. — » T', 
and from T" G out(T' ,G 3 , tl ) we obtain T'^T". Thus T . . . -»■ T' -> T", with 
i G replicas(T"). 

□ 

(P5) 

VT G submitted(p),\/i G correct(p), 

(Bt,TeGi,t)=>(Bt,op{T,Gi,t) = T) 

Proof 

Let To be a transaction submitted during p and let i be a site that eventually hold To in 
By proposition P4 it exists Ti, .. . ,T m >o G submitted(p) such that i G replicasT m and 

r -» Tj -> . . . -> r m . 

Let A; G [0, m], we note ^(fc) the following property: 
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V(k) = Vj G correct(p) (1 replicas (T/.) , 3i G op(To, Gjj ) = To 

Observe that by proposition P2 T-'(O) is true. We now proove that V(k) is true for all 
the k by induction: 

Let o, o' G Tfc x Tfc + i, and j G correct(p) such that o—>jo'. 
Let to be the first time at which j TO-delivers o during p. 

Let t 2 be the first time at which op(T k ,Gjj) = T k (since Gj.o — (0,0), and P(k) is 
true). 

Let t\ be the first time at which j To-delivers d during p. 

Observe that since o^jo', to < t\. It follow that we have three cases to consider: 
. cases t 2 < to < h and to < t 2 < t\ 

In these cases when j To-delivers o', we have: 

T k ^T k+1 G Gj >tl A op(T kl G jM = T k ) 

Thus, 

T k G pred(T k+1 , G jM ) A op(T k , pred(T k , G 3M )) = T k 

and according to Algorithm^ j sends pred(T k+ i, Gj^) to replicas (oui(Tfc+i, ))Gj i * 1 . 

Now since replicas (T k+ i) C replicas(out(T k+ i,))Gj t t 17 given a site j G replicas (Tfc+i), 
eventually j receives pred(T k+ i, Gj.t-J, and merges it into its own precedence graph. 

. case £ < *i < *2 

We consider two-subcases: 

— At t 2 j delivers an operation of T k , and this operation is different from d . Now 
since T k ^T k+ i G Gj, t2 , Vik + 1) is true. 

— If now j receives a graph G such that op{T kl G) — T k , by definition of t 2l G C 
Gjj 2 , and more precisely, pred{T k ,G) % pred(T k ,Gj^ 2 )- 

It follows that j sends pred(T k , GUGj :t2 ) to replicas {out{T kl GUUGj ii2 )) . Finally 
since by definition of fi, Tfc^Tfc+i G Gj,t 2 , we obtain Tfe + i G out(T k , G U Gj.t 2 ), 
from which we conclude that 'P(fc + 1) is true. 

To conclude observe that since i G replicas(T m ) and V(m) is true, eventually op(7o, G,.t ) = 

To. 

□ 
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VT G submitted(p),yi G correct (p), 

(3t,T G Gi,t) (3*, VT' G submitted(p),T'^T => (T',T) G G M ) 

Troo/ 

F6.1 VT, T' G submitted p,\ '0,0' G T x T', (Eli G II, o'—^oVj G replicas (o), o—>jo' 

By the Uniform Agreement and Total Order properties of Total Order MBroad- 
cast 

F6.2VT G submttted(p),\/o G T,Vi G correct(p), (3t,oop(T,G z t )) => (VT' submitted(p),T'^T 
3t,(T,T) gG m ) 

Since o G op(T, G ijt ) and Gj, = (0,0), either: 

1. i € replicas (T) A T0-deliver o (i) 

First observe that since links are reliable i G replicas (o). 

Let T' be a a transaction, o' G T' an operation, and A; a site such that o'^^o. 

By fact F6.1 since i,j G replicas{o), d— >jO. 

2. 3G G G, receive T (G) A o G op(T, G) 

According to Algorithm [2] it exists ko, . . . ,k m sites sucht that: 

. fco TO-delivers during p, and execute line 1291 sending pred{T, Gk ) with 
o G op{T, predecessor sTGk a ) and fei G replicas [out (T, Gfe )). 

. fci receives pred(T,Gk a ) during p and then execute line 1291 or line I34[ 
sending a precedence graph G such that pred(T,Gk ) C G to a set of 
replicas containing k2- 

. etc ... until i receives it. 

Consequently pred(T,Gk ) Q Gij, and according to our reasonning in item 
1, we conclude that fact F6.2 is true. 

Fact F6.2 and proposition P5 conclude. 

□ 

We are now able to proove our central theorem: every transaction is eventually closed 
at a correct site. 
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(tT) 

VT G submitted (p),\/i G correct(p), 

(3t, T e G if t) (Bt, cZose<Z(T, G M )) 

Froo/ 

We consider that a finite subset of T are sent to the system, consequently submitted(p) 
is also finite. Let Gr be the graph resulting from the transitive closure of the relation 
— > on {T}. According to proposition P6, Ct is eventually in G^i, and thus according to 
proposition P5, T is eventually closed at site i. 

□ 

(PTj 

VT G submitted(p) , Vi, j G II, Vi, i , 
(T G Gi, t AT G Gj, t / A closed(T, G ht ) A closed(T, G jjt >)) =*> (>red(T, G i>t ) = pred(T, G i)t >)) 

Proo/ 

F7.1 pred(T, Gj).V = pred(T, Gj).V 

Let T' G pred(T,G,). By definition it exists Ti,...,T m such that T' -> T x -> 
. . . — > T m — > T C Gi . By an obious induction on m using proposition P6 we 
conclude that T" is also in pred(T, Gj). 

F7.2 pred(T : G,).£ = pred(T, G.,-).£ 

Identical to the reasonning proposed for fact F7.1. 
F7.2 VT' G pred{T,Gi), op(T' , pred{T,Gi)) = op(T> ' , pred(T,Gj)) 

By fact F7.1 and since T is closed at both sites i and j. 

F7.4 {T'li&A&orted^Gi)} = {T'\isAborted(T,pgraphSitej)} 

Let T" G pred{T,Gi) such that is Aborted^ 1 , pred(T,G t )). According to Algo- 
rithm it exists a site k and a read operation r G T' such that /c TO-delivers r 
during p, and then k set the aborted flag of T' in its precedence graph. 
Now let k' be a replica of r, by the Uniform Agreement and the Total Order 
Property of Total Order Multicast, when k' TO-delivers r, it also set the aborted 
flag of T' in its precedence graph. 
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By the conjunction of facts F7.1 to F7.4. 

□ 

We proove now that p is serializable [2]. 

Let 0(x, p) be the set of write operation over the data item x during p, we define the 
relation <C as follows: 

Va; € D,Voi,02 G 0(x,p),x\ <C £2 = 3z € replicas (x),o^io' 

( P8 ) <C is a version order for p. 

Proo/ 

Let 0(ic,p) be the set of write operation over the data item x during p ; and let i G 
replicas{x) fl correct(p) (assumption Al). 

According to Algorithm [2] o is executed only if trans (o) is committed during p conse- 
quently i commits during p any transaction P such that 3o 6 wo(T), 0{x, p). Consequently 
<C is total over 0(x, p), and by the Total Order and Uniform Agreement properties of Total 
Order Multicast, <C is an order over 0(x, p). 

□ 




YT,T' G MVSG(p,-€.), 

((T, T') e MVSG(p, <) A too (T) / 0A wo(P') ^ 0) P^P' 

Proo/ 

F9.1 If (T, T') is a read-from edge, then P^P' 

Let (P, P') be a read-from relation. By definition it exists a site i, a write G P, 
and a read r[x] G P' such that during p at site i io write x then r reads the value 
written by w. 

Let t and i' be respectively the times at which these two events occured; according to 
Algorithm [2] : 

F9.1.1 TO-deliveri(o) < p t < p t' 
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Then since T' G MVSG(p,—>), T' G submitted (r), and by assumption Al, it exists 
j G replicas (x) (~l corrector) such that TO-deliverj(o'). 

Now, TO-deliveri(o) =>■ TO-deliveij (o) by the Uniform Agreement, and the Total 
Order properties of Total Order Multicast. Consequently using fact F9.1.1, 

-.(TO-deliveri(o') < p TO-delivex;(o)) =>■ TO-deliver^ (o) < p TO-deliveXj(o') 
concluding our claim. 

F9.2 If (T, T) is a version-order edge, then T->T 

Let Ti,T2,T3 be three transactions committed during p, and suppose that it exists a 
version-order edge (T ly T 2 ) G MVSG(p,^>). 

According to the definition of a version order it follows either: 

1. it exists wi G wo(T\),W2 G u>o(T 2 ), and r3 G ro(T3) such that ^[013], Wi[xi] and 
x-y < x 2 . 

By definition of xi <C x 2 =** 7\^T 2 . 

2. it exists ri G ro(Ti),toa G wo(T 2 ) and W3 G wo(T3) such that ri[x^\, iu 2 [x 2 ] and 

3^3 < X 2 . 

Let i G replicas (x) H correct (p) ( by assumption Al). Since T\ , T 2 , T3 G committed(r) C 
submitted(r) , i TO-delivers ri, u>i and 1U3 during p. Now according to the Total 
Order property of Total Order Multicast, since X3 <C x 2 , u>3^iU> 2 . 
Let j be a site on which r\ [x3] happens. Since 1^3— >iU> 2 , according to our definition 
of commit () (Section |3~5)) . w 2 -< ri. 

Now since Ti G committed(p) , necessarily ri^u> 2 (otherwise Ti is aborted: line 
By facts F9.1 and F9.2 

□ 

[ P10 ) p is serializable. 
Froo/ 

Consider the sub-graph G u of MVSG(r, <c) containing all the transactions T such that 
wo(T) 7^ 0, and the edge linking them. 

F10.1 G u is acyclic. 
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Let Ti, . . . , T m € G u such that Ti, . . . , T m >i forms a cycle in G u , and recall that 
by definition Ti, . . . , T m € committed(r) 
According to proposition P9, T\ — > . .. — > T m — > Ti. 

Let 2 be a replica of Ti , and we note t the time at which i commits T during p. 

Acoording to Algorithm [2] at time t, closed (T\, Gi t). 

Now according to Algorithm [TJ and since i commits T\ at time t, 

3k € [2,m],Tfc S breakCycles(pred(Ti,Gi t t)) 

Let j £ replicas (Tk) such that j commit T& during p, and let t" be the time at 
which this event happens. 

Since Ti £ pred(T k , G jit >) and T fc g pred(Ti,G ht ), pred(Ti,G^ t ) = pred(T k ,G Jjt > , ■) 
Consequently since breakCycles() is deterministic, j cannot commit Tfc during p. 
Absurd. 

F10.2 MVSG(p, ->) is acyclic. 

By fact F10.1 and since read only transactions are executed using two-phases 
locking. 

Fact F10.2 induces that p is serializable. 

□ 
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